30 research outputs found

    Self-Attention Temporal Convolutional Network for Long-Term Daily Living Activity Detection

    Get PDF
    International audienceIn this paper, we address the detection of daily living activities in long-term untrimmed videos. The detection of daily living activities is challenging due to their long temporal components, low inter-class variation and high intra-class variation. To tackle these challenges, recent approaches based on Temporal Convolutional Networks (TCNs) have been proposed. Such methods can capture long-term temporal patterns using a hierarchy of temporal convolutional filters, pooling and up sampling steps. However, as one of the important features of con-volutional networks, TCNs process a local neighborhood across time which leads to inefficiency in modeling the long-range dependencies between these temporal patterns of the video. In this paper, we propose Self-Attention-Temporal Convolutional Network (SA-TCN), which is able to capture both complex activity patterns and their dependencies within long-term untrimmed videos. We evaluate our proposed model on DAily Home LIfe Activity Dataset (DAHLIA) and Breakfast datasets. Our proposed method achieves state-of-the-art performance on both DAHLIA and Breakfast dataset

    LAC: Latent Action Composition for Skeleton-based Action Segmentation

    Full text link
    Skeleton-based action segmentation requires recognizing composable actions in untrimmed videos. Current approaches decouple this problem by first extracting local visual features from skeleton sequences and then processing them by a temporal model to classify frame-wise actions. However, their performances remain limited as the visual features cannot sufficiently express composable actions. In this context, we propose Latent Action Composition (LAC), a novel self-supervised framework aiming at learning from synthesized composable motions for skeleton-based action segmentation. LAC is composed of a novel generation module towards synthesizing new sequences. Specifically, we design a linear latent space in the generator to represent primitive motion. New composed motions can be synthesized by simply performing arithmetic operations on latent representations of multiple input skeleton sequences. LAC leverages such synthesized sequences, which have large diversity and complexity, for learning visual representations of skeletons in both sequence and frame spaces via contrastive learning. The resulting visual encoder has a high expressive power and can be effectively transferred onto action segmentation tasks by end-to-end fine-tuning without the need for additional temporal models. We conduct a study focusing on transfer-learning and we show that representations learned from pre-trained LAC outperform the state-of-the-art by a large margin on TSU, Charades, PKU-MMD datasets.Comment: ICCV 202

    Self-Attention Temporal Convolutional Network for Long-Term Daily Living Activity Detection

    Get PDF
    International audienceIn this paper, we address the detection of daily living activities in long-term untrimmed videos. The detection of daily living activities is challenging due to their long temporal components, low inter-class variation and high intra-class variation. To tackle these challenges, recent approaches based on Temporal Convolutional Networks (TCNs) have been proposed. Such methods can capture long-term temporal patterns using a hierarchy of temporal convolutional filters, pooling and up sampling steps. However, as one of the important features of con-volutional networks, TCNs process a local neighborhood across time which leads to inefficiency in modeling the long-range dependencies between these temporal patterns of the video. In this paper, we propose Self-Attention-Temporal Convolutional Network (SA-TCN), which is able to capture both complex activity patterns and their dependencies within long-term untrimmed videos. We evaluate our proposed model on DAily Home LIfe Activity Dataset (DAHLIA) and Breakfast datasets. Our proposed method achieves state-of-the-art performance on both DAHLIA and Breakfast dataset

    PDAN: Pyramid Dilated Attention Network for Action Detection

    Get PDF
    International audienceHandling long and complex temporal information is an important challenge for action detection tasks. This challenge is further aggravated by densely distributed actions in untrimmed videos. Previous action detection methods fail in selecting the key temporal information in long videos. To this end, we introduce the Dilated Attention Layer (DAL). Compared to previous temporal convolution layer, DAL allocates attentional weights to local frames in the kernel, which enables it to learn better local representation across time. Furthermore, we introduce Pyramid Dilated Attention Network (PDAN) which is built upon DAL. With the help of multiple DALs with different dilation rates, PDAN can model short-term and long-term temporal relations simultaneously by focusing on local segments at the level of low and high temporal receptive fields. This property enables PDAN to handle complex temporal relations between different action instances in long untrimmed videos. To corroborate the effectiveness and robustness of our method, we evaluate it on three densely annotated, multi-label datasets: MultiTHUMOS, Charades and Toyota Smarthome Untrimmed (TSU) dataset. PDAN is able to outperform previous state-of-the-art methods on all these datasets

    Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection

    Get PDF
    Video anomaly detection in surveillance systems with only video-level labels (i.e. weakly-supervised) is challenging. This is due to, (i) complex integration of human and scene based anomalies comprising of subtle and sharp spatio-temporal cues in real-world scenarios, (ii) non-optimal optimization between normal and anomaly instances under weak-supervision. In this paper, we propose a Human-Scene Network to learn discriminative representations by capturing both subtle and strong cues in a dissociative manner. In addition, a self-rectifying loss is also proposed that dynamically computes the pseudo temporal-annotations from video-level labels for optimizing the Human-Scene Network effectively. The proposed Human-Scene Network optimized with self-rectifying loss is validated on three publicly available datasets i.e. UCF-Crime, ShanghaiTech and IITB-Corridor, outperforming recently reported state-of-the-art approaches on five out of the six scenarios considered

    Toyota Smarthome: Real-World Activities of Daily Living

    Get PDF
    International audienceThe performance of deep neural networks is strongly influenced by the quantity and quality of annotated data. Most of the large activity recognition datasets consist of data sourced from the web, which does not reflect challenges that exist in activities of daily living. In this paper, we introduce a large real-world video dataset for activities of daily living: Toyota Smarthome. The dataset consists of 16K RGB+D clips of 31 activity classes, performed by seniors in a smarthome. Unlike previous datasets, videos were fully unscripted. As a result, the dataset poses several challenges: high intra-class variation, high class imbalance, simple and composite activities, and activities with similar motion and variable duration. Activities were annotated with both coarse and fine-grained labels. These characteristics differentiate Toyota Smarthome from other datasets for activity recognition. As recent activity recognition approaches fail to address the challenges posed by Toyota Smarthome, we present a novel activity recognition method with attention mechanism. We propose a pose driven spatio-temporal attention mechanism through 3D ConvNets. We show that our novel method outperforms state-of-the-art methods on benchmark datasets, as well as on the Toyota Smarthome dataset. We release the dataset for research use

    HouseCat6D -- A Large-Scale Multi-Modal Category Level 6D Object Pose Dataset with Household Objects in Realistic Scenarios

    Full text link
    Estimating the 6D pose of objects is a major 3D computer vision problem. Since the promising outcomes from instance-level approaches, research heads also move towards category-level pose estimation for more practical application scenarios. However, unlike well-established instance-level pose datasets, available category-level datasets lack annotation quality and provided pose quantity. We propose the new category-level 6D pose dataset HouseCat6D featuring 1) Multi-modality of Polarimetric RGB and Depth (RGBD+P), 2) Highly diverse 194 objects of 10 household object categories including 2 photometrically challenging categories, 3) High-quality pose annotation with an error range of only 1.35 mm to 1.74 mm, 4) 41 large-scale scenes with extensive viewpoint coverage and occlusions, 5) Checkerboard-free environment throughout the entire scene, and 6) Additionally annotated dense 6D parallel-jaw grasps. Furthermore, we also provide benchmark results of state-of-the-art category-level pose estimation networks

    Advanced stochastic local search methods for automatic design of Boolean network robots

    Get PDF

    Cognitive Abilities in Swarm Robotics: Developing a swarm that can collectively sequence tasks

    No full text
    Can robots of a swarm cooperate to solve together a complex cognitive problem that none of them can solve alone? TS-Swarm is a robot swarm that autonomously sequences tasks at run time and can therefore operate even if the correct order of execution is unknown at design time. The ability to sequence tasks endows robot swarms with unprecedented autonomy and is an important step towards the uptake of swarm robotics in a range of practical applications.Doctorat en Sciences de l'ingénieur et technologieinfo:eu-repo/semantics/nonPublishe

    Autonomous task sequencing in a robot swarm

    No full text
    Robot swarms mimic natural systems in which collective abilities emerge from the interaction of individuals. So far, the swarm robotics literature has focused on the emergence of mechanical abilities (e.g. push a heavy object) and simple cognitive abilities (e.g. select a path between two alternatives). In this article, we present a robot swarm in which a complex cognitive ability emerged. This swarm was able to collectively sequence tasks whose order of execution was a priori unknown. Because sequencing tasks is an albeit simple form of planning, the robot swarm that we present provides a different perspective on a pivotal debate in the history of artificial intelligence: the debate on planning in robotics. In the proposed swarm, the two robotics paradigms—deliberative (sense-model-plan-act) and reactive (sense-act)—that are traditionally considered antithetical coexist in a particular way: The ability to plan emerges at the collective level from the interaction of reactive individuals.SCOPUS: ar.jinfo:eu-repo/semantics/publishe
    corecore